About the Dataset

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

Load libraries

library(tidyverse)
library(GGally)
library(caret)
library(moments)
library(corrplot)
library(ggmap)
library(arules)
library(xgboost)

Data import and Preparation

Import and examine datatype of each features

dataset_path <- "/Users/truongminh/Downloads/SchoolWork/UW_Data_Analytics/Data_Mining/Final_Project"

houses <- read.delim(file.path(dataset_path,"kc_house_data.csv"),sep = ",", header = TRUE)

houses %>% glimpse()
## Rows: 21,613
## Columns: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 1954400…
## $ date          <fct> 20141013T000000, 20141209T000000, 20150225T000000, 2014…
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500…
## $ bedrooms      <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2…
## $ sqft_living   <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 18…
## $ sqft_lot      <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, …
## $ waterfront    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ view          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0…
## $ condition     <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4…
## $ grade         <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, …
## $ sqft_above    <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 18…
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0,…
## $ yr_built      <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2…
## $ yr_renovated  <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ zipcode       <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198,…
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 4…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.…
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 2…
## $ sqft_lot15    <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113,…

Exploratory Data Analysis

Examine the shape of numeric variables

We will first examine the distribution of each numeric columns using histogram From these histogram we can see that very few numeric features are normaly distributed which will decrease the reliability of the model given that the model use a linear algorithmn.

Checking for correlation among variables with the target variable using correlation matrix

corr_matrix <- houses %>% select(-c(lat,long,renovated,waterfront)) %>% cor()
corr_matrix %>% corrplot(method = "pie")

From this correlation map of the numeric features, we can see that: - features relating to sizes seems to influence the prices of the house the most, namely: “bedrooms”, “badrooms”, “sqft_living”, “sqft_lot”, “sqft_above”, “sqft_living15”, “grade”. However, we need to keep in mind that these features are also correlated with one another as well.Therefore, some of these features need to be filtered out to avoid multicolinearity.

  • Unexpectedly, age has very little to contritute to the price of the residence
#Feature extraction by filtering 
houses <- houses %>% select(-c(sqft_living15, sqft_lot15, sqft_above,age))

Further explore the relation ship between size features and price

Let’s furture explore the 3 discrete variables floors, bedrooms and bathrooms It seems that among the 3 variables, the number of bathrooms seems to be the feature that make the most significant impact to the price of the house

Visualizng price by categorical variables

Seeing how price differs in each variable grouping using viloin plot

Each of these features has a visible effect on price. View seems to be the features where the effect on price is less distinctiive

Visualizing price by location

Intuitively, it makes sense that the location of the house largely influence the price as well, let’s visualize the distribution of house prices on map using longtitude and latitude as metrics

We can see there is a pattern in how the higher priced houses are distributed. there is a high concentration of high price houses in the Ballard area, followed by West Seattle, Downtown Seattle, Mercer Island and Bellvue. Another interesting factor we can see from there is that a lot of the darker area are around waterfront housings

Discretizing geoprahical values

It’s clear how longtitude and lattitude influence house price. However, the current format of these variables make it difficult to extract meaningful information. Therefore, I’m gonna discretize tand group them into categories

houses$vert_loc <- discretize(houses$lat,
                                    method = "frequency",
                                    breaks = 3,
                                    labels = c("South","Mid","North"))

houses$horz_loc <- discretize(houses$long,
                                     method = "frequency",
                                     breaks = 3,
                                     labels = c("West","Mid","East"))

houses <- houses %>% select(-c(lat,long))

Preprocessing

Transforming numerical features

From examining the histogram in the previous section, we came to the conclusion that most of the numeric features are highly skewed and does not follow a normal distribution. Let’s check for their skewness

num_feat = houses %>% select(-c(waterfront,renovated, vert_loc, horz_loc)) %>% names()
apply(houses[num_feat],2,skewness) # table showing the skewness of each variable
##         price      bedrooms     bathrooms   sqft_living      sqft_lot 
##     4.0237899     1.9741625     0.5110721     1.4714533    13.0591125 
##        floors          view     condition         grade sqft_basement 
##     0.6161340     3.3955139     1.0327330     0.7710497     1.5778555

There are 2 highly skewed variables in this dataset: price and sqft_lot. To improve prediction accuracy, let’s change all of them to log scale so that they will fit the Gausian Distribution

# Convert skewed data to log scale
houses <- houses %>% mutate(sqft_lot = log10(sqft_lot),
                    price = log10(price))

Encoding categorical features

horz_loc <- model.matrix(~horz_loc - 1, data = houses)
vert_loc <- model.matrix(~vert_loc - 1, data = houses)
waterfront <- model.matrix(~waterfront - 1, data = houses)

houses <- cbind(houses %>% select(-c(horz_loc, vert_loc, waterfront)), horz_loc, vert_loc, waterfront)

Split and Scale data

train <- createDataPartition(y = houses$price, p = 0.7, list = FALSE)

houses_train <- houses[train,]
houses_test <- houses[-train,]

prepro <- preProcess(houses_train[num_feat],method = c("nzv","center","scale"))

houses_train <- predict(prepro,houses_train)
houses_test <- predict(prepro,houses_test)

Modeling

For this regression problem, I’m going to try using 2 method namely ridge regression and xgboost regression tree. - Ridge is a regularized regression method which is known for its simplicity, it penaltize the model adjusted with the bias-variance trade off which mean it has lower accuracy on training data which make it much less likely to overfit. This penalty will also allows it to has less variance on unknown data

set.seed(1808)
ridge_model <- train(price ~ .,
                     data = houses_train,
                     method = "ridge",
                     tuneGrid = expand.grid(lambda = seq(0, 1, 0.05)),
                     trControl = trainControl(method = "cv", 10))

xgboost_model <- train(price ~ .,
                      data = houses_train,
                      method = "xgbTree",
                      trControl = trainControl("cv", number = 10))

Model evaluation

Validating

Compare MAE, RMSE and Rsquared score after cross validation of each model

compare <- resamples(list(xg_boost = xgboost_model,
                          ridge_regression = ridge_model))

bwplot(compare)

Cross valiation score on training data shows xg_boost model yielding slighty better accuracy. XGBoost has always been known for its great accuracy, the big drawback of it is its possibility of overfitting, that’s the main reason I decide to use ridge regression as my back up model as its compensate for these shortcoming of XGBoost in return for lower accuracy.

Let’s see how each model perform on the training dataset

Testing

compare_on_train <- resamples(list(ridge = ridge_model,
                                   xgboost = xgboost_model))

compare_on_test <- data.frame(row.names = c('RMSE','RSquared','MAE'))

compare_on_test['ridge_model'] <- postResample(predict(ridge_model,houses_test),houses_test$price)
compare_on_test['xgboost_model'] <- postResample(predict(xgboost_model,houses_test),houses_test$price)

bwplot(compare_on_train)

compare_on_test
##          ridge_model xgboost_model
## RMSE       0.4773049     0.4362115
## RSquared   0.7682076     0.8062938
## MAE        0.3671914     0.3262627

XGBoost still outperform the ridge model on unforseen data which once again prove its effectiveness. After examining their performance on the new training data, it’s safe to use the XGBoost model now.

Conclusion

In conclusion, with R-Squared at 0.8 the XGBoost model seems to be a good choice for predicting house prices. However, it worth mentioning that the scope of the variable used for these models are much larger than the variables given in the problem.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.